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ABSTRACT 

A variety of applications are emerging to support streaming 
video from mobile devices. However, many tasks can bene- 
fit from streaming specific content rather than the full video 
feed which may include irrelevant, private, or distracting 
content. We describe a system that allows users to capture 
and stream targeted video content captured with a mobile 
device. The application incorporates a variety of automatic 
and interactive techniques to identify and segment desired 
content in the camera view, allowing the user to publish a 
more focused video. 
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INTRODUCTION 

As the processing power of mobile devices improves, they 
are being used for more computationally intensive tasks. Ser- 
vices have begun to offer live video streaming from mo- 
bile devices (e.g., Qik [ 10]), potentially allowing anyone to 
stream any event anywhere at anytime. However, as we have 
seen in the desktop world, unfiltered streaming, while useful, 
is not appropriate for every task. By filtering streamed con- 
tent, presenters can better focus the audience's attention, im- 
prove bandwidth efficiency, and mitigate privacy concerns. 
Furthermore, mobile content capture faces challenges not 
present on desktops - for example the recording device may 
be off-axis from the desired content and may be hand-held 
and therefore unstable. 
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Figure 1. Screen detection. The application's first estimate shown 
in yellow (a) is inaccurate, but the second (b) correctly identifies the 
screen. When the user locks the ROI it changes color to red and the ap- 
plication displays a transparent image of the warped and cropped ROI 
in the lower- right, (c). As the user repositions the device, the applica- 
tion detects the motion and begins finding a new ROI, (d). The user can 
press a button to lock the ROI to the new guess, (e). 

The system we present, mVideoCast, helps filter and correct 
video captured and streamed from a mobile device (see Fig- 
ure [I]). The application can detect, segment, and stream con- 
tent shown on screens or boards, faces, or arbitrary regions. 
This can allow anyone to stream task- specific content with- 
out needing to develop hooks into external software (e.g., 
screen recorder software). While past work has supported 
streaming ROIs from video, mVideoCast is the first tool to 
do so live from a mobile device. In this paper we describe 
the system, algorithms we use to detect regions-of-interest 
(ROIs) in video frames, and our experiences with early pro- 
totypes of the application. 

MVIDEOCAST 

mVideoCast is architected as a client-server system in which 
mobile clients publish content to a remote machine. The 
mobile application, implemented on the Android platform, 
can both record captured media on the device locally as well 
as stream content to a remote server in real time, allowing 
remote users to view live content. Users can control when 
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Figure 2. Data flow. Audio (a) can be saved locally and streamed to a 
remote server. Video frames (b) can be saved, streamed, and used to 
detect ROIs. 



the application is recording content locally and when content 
is streaming live to a remote server. 

Capturing and correcting the image frame 

The mobile application extracts a quadrilateral ROI from 
each video frame, warping and cropping the frame to match 
a pre-defined output resolution and aspect ratio (see Figure 
[2]). The application runs three separate threads in order to 
save content locally, stream content to a remote server, and 
detect ROIs. The application first opens the camera and re- 
quests to receive preview frames. It stores received frames 
in a queue, and if it is in record mode it saves queued images 
locally. It then pushes frame events to the other two threads. 
The stream thread checks to see that it is in stream mode 
and that it has processed the last event. It then generates a 
compressed color image of the frame and posts it to a remote 
server (full images are sent, rather than only the ROI, to sup- 
port potential post hoc edits to the stream). The detection 
thread similarly checks that detection is enabled and that it 
has processed the last frame. It then generates a grayscale 
image and runs the screen detection algorithm (methods for 
defining the ROI are described in the next section). Once a 
candidate quadrilateral is found, it is set in the main thread 
and shown on the display as an overlay drawn over the video 
preview (see Figure [I]). 

Audio 

Because video frames undergo several processing steps and 
may not maintain a constant frame rate, audio is captured, 
saved, and streamed separately. While recording, captured 
audio is saved to a local file. While streaming, audio is 
Speex p2| encoded and forwarded to an Icecast (4) server 
running remotely. A web page associated with each user 
merges the audio with processed frames (see Figure [3]). 
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Figure 3. Viewing a stream. The server publishes a web page for each 
client that shows the cropped and warped ROI from the latest frame as 
well as the client's audio stream (disabled here). 

SELECTING A REGION 

Users can define a ROI using a variety of methods ranging 
from fully manual to fully automatic. 

Manual selection 

Users can set the ROI by tapping on the screen to set the 
four corners of the quadrilateral. The corner nearest the 
tapped point is set to the tapped location. There are also 
shortcuts to set certain standard sizes more quickly; clicking 
on the upper-left and lower-right corners of the quadrilateral 
in rapid succession define a rectangle, and double-tapping 
anywhere on the screen sets the entire preview frame as the 
captured region. 

Light tags 

If the user has control of the display he wants to capture, he 
can switch the mobile application to a mode that detects light 
tags (e.g., LEDs) attached to the display. In this mode the 
mobile application detects the bright points corresponding 
to the light tags to determine the corners of the ROI (see 
Figure [4]). 

Screen detection 

Users can enable a screen detection mode that will automati- 
cally attempt to determine screen regions within frames. We 
make use of the JJIL toolkit (5) in order to detect screens 
in a frame as follows: 1) Run a Canny edge detector over 
a grayscale version of the frame; 2) Remove all but the top 
5% most significant edges from the result; 3) Divide the re- 
maining points into four regions representing the top-half, 
bottom-half, left-half, and right-half of the image; 4) Send 
the subregions to a Hough line fit algorithm to find the domi- 
nant line in each subregion; 5) Construct a quadrilateral from 
the resulting lines. 

This approach is typically not immediately accurate, so while 
in screen detection mode the application presents the current 
ROI estimate to the user every few seconds (see Figure [T}. 
When the user is satisfied with a result, he can use a hard- 




Figure 4. Light tag detection. The mobile application can search for 
light tags (e.g., LEDs) attached to displays. 

ware button to "lock" the current ROI coordinates in place. 
When a user locks an ROI, the application displays a thumb- 
nailed view of the warped and cropped region in the lower- 
right of the screen. 

We anticipate that the user will move around while record- 
ing, and after moving the ROI will no longer be accurate. To 
address this issue, the mobile application continuously mon- 
itors the mobile device's onboard accelerometer and com- 
pass to detect a change in position. When it determines 
that the device has moved, it begins generating new poten- 
tial ROIs. While it generates guesses it also maintains the 
past ROI. When the user decides that a new estimate is a 
better match, he can click the hardware button to set it as the 
current ROI. 

The user can also adjust the corners of the quadrilateral at 
any time using the manual controls described above. 

Face detection 

In another mode the application uses Android's face detec- 
tor libraries to set the ROI. In this mode, the application au- 
tomatically updates the ROI with the location of the most 
salient face detected in each frame (see Figure [5]). In this 
mode the application does not adjust the output aspect ratio 
and only crops the frame. 

SCENARIOS 

mVideoCast can be used for a variety of tasks. 

Reporting: A technology reporter uses his mobile phone's 
camera to record the screen of a demonstration of new 
software at a local startup. mVideoCast lets him generate 
a clean, bandwidth efficient upstream for his followers to 
see live or asynchronously. 

A news reporter is interviewing a subject on the street. By 
using the face detection mode, mVideoCast allows the re- 
porter to easily stream a focused view of the interviewee, 
reducing distractors in the video and possibly preserving 
the privacy of bystanders. 

Remote demonstrations: A business user wants to present 
a demonstration of a software application while sitting at 
a cafe. He starts mVideoCast on his phone and can stream 
only his screen to the remote participants; the system will 
crop regions out of the screen such as the cafe surround- 
ings and his croissant. 
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Figure 5. Face tracking. The mobile application can set the ROI to be 
the most salient face in a frame. 

Sharing other mobile screens: A designer wants to doc- 
ument how a web page renders on his iPhone. He uses 
a second mobile device with mVideoCast to record his 
iPhone screen. The application picks up the iPhone screen 
boundaries in the video stream, unwarps it and streams it 
to remote viewers. 

Troubleshooting: Alice is trying to send a FAX using the 
her office's multifunction printer, but the machine stops 
unexpectedly while processing her paper. She decides to 
call support and puts her phone on loud-speaker, launches 
mVideoCast, and points her device at the printer's LCD 
screen. The software automatically detects the boundaries 
of the LCD screen, un-warps its image and sends it to a 
member of the support staff who can more easily view it 
and guide her to a solution. 

EXPERIENCE 

We asked four subjects to provide initial feedback on a pro- 
totype of the system running on an Android Nexus One de- 
vice. We asked them to imagine a scenario in which they 
were sending live video of their desktop monitor to a remote 
party. We explained the goal of the system was to help them 
automatically crop around the ROI, and that manual adjust- 
ments were possible by dragging the automatically selected 
edges to other locations or tapping the screen. 

User feedback provided several potential areas of improve- 
ment. Because many edges are present in the image of a 
typical computer desktop, the subjects often had to manu- 
ally readjust the area automatically selected by the system. 
This proved problematic, as dragging or touching corners 
with fingers would sometimes hide most of the screen and 
made people move the device's screen, causing irregulari- 
ties with the selected ROI. In general, holding the device 
steady was a challenge, and user's hands would naturally 
move when they had to look at the object they wanted to cap- 
ture - looking at the device's display was not enough. One 
user would always let the system guess the good tracking 
rectangles, never locking a guess by pressing the designated 
button. Another noticed that guesses where sometimes cor- 
rect, but he did not have time to press the button before the 
next guess happened. He suggested a "back" button to val- 
idate previous system guesses. Also, pressing the hardware 
button resulted in significant device motion, occasionally in- 
terfering with the ROI selection. A user suggested replacing 
this action by a touch in the middle of the detected area. 




Figure 6. Using image stabilization to improve screen detection. We 
are experimenting with optical flow techniques that can make minor 
adjustments to the original frames (top) in order to keep the ROI reg- 
istered without user intervention (bottom). 

User feedback also revealed new use cases for the tool. Three 
subjects had to step far back in order to capture their 30" 
monitor, which prompted them to try capturing individual 
windows within their desktop area. They actually expressed 
interest in this use case, capturing for example a login win- 
dow or a web browser window. One user said he would ac- 
tually like this system to capture an unwarped picture of a 
window, not necessarily a video recording. Another partici- 
pant spontaneously and successfully used the application to 
capture parts of his whiteboard. 

RELATED WORK 

While other research projects have explored video retarget- 
ing, or automatically selecting salient subregions of a video 
for redisplay on smaller screens such as mobile devices (7), 
mVideoCast uniquely allows users to stream specific ROIs 
from a mobile device. 

There have been a variety of applications that have used 
ROIs in video in non-mobile contexts. Researchers have in- 
vestigated user- and group-defined ROIs to control cameras 
for remote collaboration tasks (8][TTJ. Similarly, the Diver 
system allows users to create videos from cropped clips of a 
prerecorded, panoramic video (9). Other tools have explored 
automated solutions. El-Alfy et al. investigated automati- 
cally cropping surveillance videos to salient events [2]. An- 
other focus of past work is the removal of individuals from 
video recordings or video conference streams (such as (TJ). 

There is existing work concerning the detection of docu- 
ments in still images, including business card detection (3|. 
It is becoming more common for compact digital still cam- 
eras to include a document mode which will attempt to au- 
tomatically identify the outline of a document and allow the 
user to save a perspective-corrected version of the image. 
The iPhone application, JotNot (6), provides a similar quadri- 
lateral region crop-and-correct capability for still images, but 
requires manual selection of the target corners. 

CONCLUSION AND FUTURE WORK 

In the past remote communication suffered primarily from 
a lack of bandwidth. Today networked, mobile multimedia 
devices are ubiquitous, and the core challenge is not how to 
transmit more information but rather how to communicate 
the right information. mVideoCast is a small but important 
step toward this goal. 



use of optical motion compensation to maintain the align- 
ment of the ROI between explicit detections. The step of 
locking coordinates provides an anchor to which subsequent 
frames can be registered. That is, if the ROI is locked at 
frame 0, subsequent frames 1 ... N can be aligned to this 
reference frame with well known automatic image regis- 
tration techniques. For instance, we can compute a transfor- 
mation given a set of corresponding image coordinates, de- 
termined by matching image features (see Figure [6]). In this 
way, the initial lock can be used without losing the position 
due to camera motion. 
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We are working to improve mVideoCast in a number of ways. 
First, we are investigating new single-handed controls for 
setting the ROI manually. Furthermore, we are exploring the 



